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Abstract 

The battle between email service providers and senders 
of mass unsolicited emails (Spam) continues to gain trac- 
tion. Vast numbers of Spam emails are sent mainly 
from automatic botnets distributed over the world. One 
method for mitigating Spam in a computationally ef- 
ficient manner is fast and accurate blacklisting of the 
senders. In this work we propose a new sender reputa- 
tion mechanism that is based on an aggregated histori- 
cal data- set which encodes the behavior of mail transfer 
agents over time. A historical data-set is created from 
labeled logs of received emails. We use machine learn- 
ing algorithms to build a model that predicts the spam- 
mingness of mail transfer agents in the near future. The 
proposed mechanism is targeted mainly at large enter- 
prises and email service providers and can be used for 
updating both the black and the white lists. We evaluate 
the proposed mechanism using 9.5M anonymized log en- 
tries obtained from the biggest Internet service provider 
in Europe. Experiments show that proposed method de- 
tects more than 94% of the Spam emails that escaped 
the blacklist (i.e., TPR), while having less than 0.5% 
false-alarms. Therefore, the effectiveness of the pro- 
posed method is much higher than of previously reported 
reputation mechanisms, which rely on emails logs. In 
addition, the proposed method, when used for updating 
both the black and white lists, eliminated the need in au- 
tomatic content inspection of 4 out of 5 incoming emails, 
which resulted in dramatic reduction in the filtering com- 
putational load. 

1 Introduction 

Surveys show that in recent years 76% to 90% of all 
email traffic can be considered abusive (251 [D - A major 
portion of those billions of Spam emails annually are au- 
tomatically produced by botnets llSlEl- Bots create and 
exploit free web-mail accounts or deliver Spam emails 



directly to victim mailboxes by exploiting the computa- 
tional power and network bandwidth of their hosts and 
sometimes even user credentials. Many Spam mitigation 
methods are used by email service providers and organi- 
zations to protect mail boxes of their customers and em- 
ployees respectively. There are three main approaches 
for Spam mitigation; content-based filtering (CBF), real- 
time blacklisting or DNS based blacklists, and sender 
reputation mechanisms (SRM). All three approaches are 
briefly described in Section [2] 

While CBF are considered as the most accurate Spam 
mitigation methods, they are also the most computation- 
ally intensive and sometimes considered as privacy in- 
fringing. In order to speed up the filtering process, orga- 
nizations maintain blacklists of repeated Spam senders 
(28} [27] ESI - Those blacklists usually complement exist- 
ing CBF methods by performing the first rough filtering 
of incoming emails. Organizations that choose to main- 
tain their own blacklists gain flexibility in blocking / un- 
blocking certain addresses and the ability to respond to 
emerging Spam attacks in real-time. Flexibility in man- 
aging the blacklist is also very important for large email 
service providers that must react immediately if they re- 
ceive complaints about emails not reaching their destina- 
tion. 

Sender reputation mechanisms are used to refine 
blacklisting strategies by learning the liability of mail 
transfer agents (MTA). Most work in this research field 
focuses on extracting meaningful features from commu- 
nication patterns of MTAs and from the social network 
structure (2 [12 ESS Hill] El In Section[3]we elaborate 
on sender reputation mechanisms. Beside estimating the 
liability of MTAs, SRMs also help to respond to emerg- 
ing Spam attacks in a timely manner. By analyzing MTA 
behavior they try to identify patterns which indicate that 
the particular network address or a subnet is exploited 
by Spammers. SRMs are especially important when one 
requires a quick detection of sending pattern changes of 
mail transfer agents. For example, a white-listed address 
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belonging to a small bankrupted firm that once main- 
tained legitimate email servers is a lickerish target for 
exploitation by Spammers. A Good SRM should be able 
to detect a change in the behavior of such address and 
suggest removing it from a white list and adding it to a 
black list. 

In this paper we investigate methods for updating the 
reputation of sender MTAs from the perspective of a sin- 
gle email service provider. Based on an anonymized 
log of 9.5M emails, obtained from a large email ser- 
vice provider, we created a historical data set (HDS) that 
encodes the behavior of sender MTAs over time, as de- 
scribed in Section [5] Machine learning (ML) algorithms 
were applied on both the features extracted from the 
email log and on the HDS in order to create a sender be- 
havior models for deducing the black and white lists. We 
empirically compare email log (EL) based models, HDS 
based models, and commonly used blacklisting heuristic 
in a setup where these SRMs are applied on emails that 
have passed the provider's blacklist. 

Based on experiment results, described in Section |7j 
we show that an analysis of past behavior of a sender 
MTAs encoded in HDS enables the filtering out of as 
much as 94% Spam emails that have passed the black- 
list while having only 0.5% false positives. Finally, we 
also show that by frequent updating of both the black 
and white-list it is possible to spare content analysis of 
roughly 82% of incoming emails. We discuss the results 
and limitations of this study and propose directions for 
future research in Section [8] 

2 Background 

In this section we discuss in further details three promi- 
nent approaches for Spam mitigation and some im- 
portant previous works focusing on machine learning 
and sender reputation mechanisms. Spam-blocking ap- 
proaches can be roughly divided into three main cate- 
gories: (1) content-based filtering (CBF), (2) real-time 
blacklisting (RBL), and (3) sender reputation mecha- 
nisms (SRM) (see Table [T]). Next, we briefly describe 
the different categories. 

Content-based filtering (CBF) refers to techniques 
in which emails' body, attached executables, pictures 



or other files are analyses and processed for producing 
some features upon which email classification is made 
E H [H [SI The email's content is related to the 
application-level, the highest level of the Open Systems 
Interconnection (OSI) model. Content-based features 
have a lot of useful information for classification, how- 
ever, in an Internet Service Provider's (ISP) perspective, 
there are some disadvantages. First, in order to classify 
the incoming emails, each email must be put through a 
relatively heavy-weight content-based filter. This leads 
to a lot of computational resources wasted on filtering, 
and thus makes it fairly costly when compare to other 
approaches, such as real-time blacklisting, which will 
be discussed later. A second disadvantage of CBF is 
that Spammers continuously improve their CBF evading 
techniques. For example, instead of sending plain textual 
Spam emails, they use Spam-images or smarter textual 
content that obfuscates the unwanted content. 

Real-time blacklists (RBL) are IP-based lists that 
contain IP prefixes of spamming MTAs and are regarded 
as network-level filters. Using the RBL, large firms such 
as ISPs can filter out emails originating from spamming 
IPs. The filtering is very fast since the decision to ac- 
cept or reject the email does not require receiving the 
full email (saving network resources) nor processing its 
content (saving computational resources). In order to 
avoid misclassification, RBLs must be updated systemat- 
ically. For example, Spamhaus (28), Spam-Cop |27], and 
SORBS l26l are some initiatives that keep RBL systems 
updated by tracking and reporting spammer IPs. RBL 
methods, however, cannot solve the Spam email prob- 
lem entirely, as spammers can escape them by, for exam- 
ple, repeatedly changing their IPs by stealing local net- 
work IPs fT4ll . or by temporarily stealing IPs using BGP 
hijacking attacks l22l . Another shortcoming of RBL is 
that whenever an IP prefix is blacklisted, both spammers 
and benign senders who share the same prefix might be 
rejected. Benign senders can also be blocked because 
of inaccurate blacklisting heuristics. In order to lower 
the false-positive rates, blacklisting heuristics limit their 
true positive rates, allowing many Spam-mails to pass 
the filter and block mainly repeated spammers. RBL usu- 
ally has lower accuracy than CBF, which is an acceptable 
trade-off given its real-time nature and low utilization of 
computational resources. 

Sender reputation mechanisms (SRM) for Spam 
mitigation are methods for computing a liability score for 
email senders. The computation is usually based on in- 
formation extracted from the network or transport level, 
social network information, or other useful information 
sources. According to Alperovitch et. al Q, sender 
reputation systems should react quickly to changes in 
sender's behavioral patterns. That is, when sender's 
sending patterns take a shape of a spammer, his reputa- 
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tion should decrease. If the reputation of a sender is be- 
low the specified threshold, the system should reject his 
mails, at least until the sender gains up some reputation 
by changing his sending properties. One of the strong 
advantages of sender reputation mechanisms is that they 
complement and improve RBLs both in terms of detec- 
tion accuracy and response to changes. 

Despite the fact that the addresses of MTS which 
were spotted repeatedly sending Spam will usually not 
start sending legitimate emails all of a studden, there 
are several reasons to have accurate SRMs that react 
quickly to changes in sender behavior. First, addresses 
of once legitimate email servers that are no longer used 
are exploited by spammers due to their high reputation 
in databases of large email service providers. Spammers 
can use the window of opportunity created by such ad- 
dresses for as long as it takes for SRMs to detect the 
change in behavior of these addresses. Quick reaction 
of SRM is also required when small legitimate email ser- 
vice providers are used to launch massive Spam attacks. 
When such an attack is detected, operations may decide 
to temporarily blacklist the provider in order to avoid be- 
ing overwhelmed by Spam emails. However, as soon as 
the attack is over, the provider addresses should be re- 
moved from the blacklist. 

Finally, the importance of SRMs will further increase 
with the prevalence of IPv6 in the Internet. The im- 
pact of botnets on the prevalence of Spam is significantly 
reduced by blacklisting all known dynamic IP ranges. 
This simple heuristic assumes that legitimate MTAs are 
not hosted by end users, which are given dynamic IPs 
by their Internet service providers. Currently, in order 
to simplify the DHCP configuration, dynamic IPv4 ad- 
dresses are arranged in continuous ranges which are very 
easy to blacklist. However, with IPv6, there will not be 
a necessity for dynamic addresses and the whole address 
space is expected to become more fragmented, making it 
difficult to blacklist large IP ranges with simple heuris- 
tics. Furthermore, the extremely large address space and 
auto-configuration functionality of IPv6 are expected to 
increase the cost of Spam mitigation ifTTl . 

Current paper describes SRM that is based on aggre- 
gated spatio-temporal and application level features ex- 
tracted from logs of incoming email. 

3 Related Works 

In this section we discuss some sender reputation mech- 
anisms that share a similar problem domain as the pro- 
posed HDS based method. Several works have used ma- 
chine learning algorithms to compute sender reputation 
from data sets which are not based on email content anal- 
ysis. 



Ramachandran et al. (23), research a sender reputa- 
tion and blacklisting approach. They present a black- 
listing system, SpamTracker, to classify email senders 
based on their sending behavior rather than the MTA's 
IP addresses. The authors assume that spammers abuse 
multiple benign MTAs at different times, but their send- 
ing patterns tend to remain mostly similar even when 
shifting from one abused IP to another. SpamTracker 
extracts network-level features, such as received time, 
remote IP, targeted domain, whether rejected, and uses 
spectrum clustering algorithms to cluster email servers. 
The clustering is performed on the email destination do- 
mains. The authors reported a 10.4% TPR when using 
SpamTracker on a data set of 620 Spam mails which 
were missed by the organization filter and eventually 
were blacklisted in the following weeks. 

Tang et al. [29 ] addressed the Spam imbalanced clas- 
sification task with a new version of SVM, the GSVM- 
BA. The imbalance problem exists in the Spam detection 
domain due to the fact that there are around 10 Spam 
mails for each non-Spam mail [1]. The main two at- 
tributes GSVM-BA adds to SVM are a granular comput- 
ing, which makes it more computationally efficient and 
a mechanism for under sampling the data set positive in- 
stances, so that a good classifier could be learned in a 
highly unbalanced domain. The authors use the proposed 
GSVN-BA to classify IPs into Spam-IP or non-Spam- 
IPs, (i.e., learns IPs reputation). They used two types of 
aggregate features, which are both derived from sender's 
IP, receiver's IPs, and a sending time. Their experiments 
showed a very high precision rate at 99.87% with a recall 
of 47%. 

Sender reputation mechanisms are not only limited to 
network-level features; reputation can also be learned 
from a much higher communication level, such as the 
social network level. The social-network-based filtering 
approach takes advantage of the natural trust system of 
social networks. For example, if the sender and the re- 
ceiver belong to the same social group, the communi- 
cation is probably legitimate. On the other hand, if the 
sender does not belong to any trustworthy social group, 
he is more likely to be blacklisted. There are many meth- 
ods which make a good use of social networks to create 
both black and white lists. For example, J. Balthrop et 
al. (4|, used email address books to compute sender trust 
degree. 

Boy kin and Roychowdhury Q extract social net- 
work's features from the email header fields such as 
From, To, and CC, from which they construct a social 
graph, as can be observed from a single user's perspec- 
tive. Later, they find cluster of users who can be regarded 
as trusted. Finally, they train a classifier on the email ad- 
dresses in the white list and black list. The authors re- 
ported 56% TPR with the black list and 34% TPR with 
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the white list. Their method, empirically tested on three 
data sets of several thousands of emails, did not have any 
false positives. The downside of the proposed algorithm 
is that both the black and white lists can be made erro- 
neous. For example, the black list can be fooled by at- 
tackers which use spy ware to learn the user's frequently 
used address list and have one or more of them added to 
the Spam mail so that the co-recipients (the Spam vic- 
tims) will look like they belong to the user's social net- 
work. The white list can also be fooled by sending Spam 
mail by using one or more of the user's friends accounts. 
This can be done, for example, if the friend's computer 
has been compromised by a bot which selectively send 
Spam mails. 

A supervised collaborative approach for learning the 
reputation of networks is presented by Golbeck and 
Hendler fTTTl . The proposed mechanism is aided by the 
user's own scores for email senders. For example, a user 
can assign a high reputation score to his closest friends, 
and they in their turn may assign a high reputation rank to 
their friends. In this way, a reputation network is created. 
The reputation network may be used as a white-list as a 
recommendation system with very low false positive rate 
or allow the user to sort emails by reputation score. Sim- 
ilar approach is presented by Xie and Wang (32). They 
focus on collaboration among several email domains in 
order to increase the coverage of senders. Each provider 
compares the email histories obtained from its peers via 
the proposed Simple Email Reputation Protocol (SERP) 
with its own records in order to establish trustworthiness 
of the received data. 

Beverly and Sollins |5 ] investigated the effectiveness 
of using transport-level features, i.e., round trip time, 
FIN count, Jitter, and several more. The best features 
were selected using the forward fitting method to train a 
SVM-based classifier. They reported 90% accuracy on 
60 training examples. However, one of the weaknesses 
of their method when compared to RBL and other repu- 
tation mechanisms is that emails (both Spam and Ham) 
must be fully received in order to extract their features, 
thus making the Spam mitigation process less effective. 

In addition to the above mentioned SRMs, there is an- 
other line of research that focuses on inferring the repu- 
tation of senders from spatial and temporal features ex- 
tracted from emails logs. 

SNARE, by Hao et al. fT3l presented a method based 
solely on network-level and geodesic features, such as 
distance in IP space to other email senders or the geo- 
graphic distance between sender and receiver. SNARE 
classification accuracy has been shown to be comparable 
with existing static IP blacklists on data extracted from 
McAfee's TrustedSource system (T9ll . 

Liu lfl8ll proposed a policy that assigned reputation to 
senders according to the results returned by a contend 



based filter. The author proposes several simple rules 
that, nevertheless, are able to reduce the number of non 
caught low volume spammers and accurately selects the 
set of mixed senders such that the ham/Spam ratio is 
maximized. Finally, West et al. 1 30 ] introduce a reputa- 
tion model named PRESTA (Preventive Spatio-Temporal 
Aggregation) that is used to predict behavior potential 
spam senders based on email logs. The authors report 
that PreSTA identifies up to 50% spam emails that have 
passed the blacklist having 0.66% false positives. 

In the current work, we report more than 94% true pos- 
itive detection rate and up to 0.6% false positives in a 
similar scenario with one week of email logs obtained 
from a large email service provider. 

4 Learning From the Email Log 

The data set used in this research contained 168 hours 
(7 days) of anonymized email log obtained from a 
large email service provider. The provider maintains its 
own Spam mitigation solution which includes black-and 
white lists and a CBF. The email log includes only emails 
that have passed the black-list and are labeled by a CBF 
named eXpurgate ifTOl . The developers of eXpurgate 
claim "A Spam recognition rate of over 99%" and "zero 
false positives " with unpublished false negative rate. 

The email log was parsed to create a relational data set 
in the following way. Let IP denote the Internet address 
of the Sender MTA. In the following discussions we will 
assume an implicit partition of the IP address into four 
fields: two fields (MSB and LSB) of in CIDR (Class- 
less Inter-Domain Routing) notation, the subnet identi- 
fiers IP/8, IP/16, IP/24, and the host identifier IP/32. 
Let EL = {IPJ,NR,AE,PT,SpamClass} m be the rela- 
tional data set representing the email log. It contains one 
line for each received mail where T is the receiving time, 
NR is the number of recipients, AE is the number of ad- 
dressing errors, PT is the time spent by eXpurgate for 
processing the email, SpamClass is the binary mail clas- 
sification (Spam = 1 or Ham = 0) obtained from eXpur- 
gate, and m is the number of emails. Table [4] presents 
the properties of the EL data set, used in the experiment. 
Table [2] is an example of EL. This example will be used 
to demonstrate the construction of HDS in Section 

We used machine learning (ML) algorithms to cre- 
ate sender reputation models based on EL. As can be 
seen from Table [3] the subnet identifiers play significant 
role in ML models based on EL. This fact suggests that 
ML algorithms are able to identify spamming addresses 
even based on emails that have passed the blacklist of the 
email service provider. We will denote the model built by 
ML from the EL data set as EL-Based SRM. 

Let BLT and WLT be the blacklisting and the 
whitelisting thresholds, respectively. If the EL-Based 
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Table 2: Example of an email log 
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Table 4: The email log data set 



Table 3: Infogain ranking of EL features. 
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SRM classifies an EL instance as Spam with confidence 
above BLT the respective IP address is added to the 
black-list. The white-list is updated symmetrically. A 
similar approach using a different set of features, was 
taken in fT3ll . In Section|7]we will evaluate the effective- 
ness of EL-Based SRM for updating the black and the 
white-lists and use it as a baseline for comparison to the 
proposed SRM. It should be noted that the performance 
of the EL-Based SRM roughly matches the performance 
of state-of-the-art methods. 

Another simple SRM which can be applied directly 
on the EL data set is a heuristic that detects repeated 
spammers and trusted legitimate email senders. This 
Heuristic-SRM is based solely on the fraction of Spam 
mails sent by the IP in the past. Let [T start ,T en d) denote 
a continuous time range starting at T start (inclusive) and 
ending at T en d (exclusive). 



Definition 1 Spammingness of a sender IP ( denoted by 
Yip\T startl T end )) is the fraction of Spam emails sent by the 
IP during the time window [T start ,T en d). 



If the spammingness of an IP in the past is above BLT \ 
we blacklist it. Symmetrically, if the spammingness of an 
IP is below WIT , we put it in the white-list. Since most 
of the IP addresses are either legitimate email senders or 
potential spammers this heuristic performs quite well in 
practice for large time windows. However, it can hardly 
detect changes in the sender behavior and therefore can- 
not react to it in a timely manner. 

Making aggregations which capture the entire history 
of an IP may not be optimal as it releases the focus 
from the most recent statistics that may indicate behav- 
ior changes. A better approach is to split the history on 
an IP into several non-congruent time windows, as de- 
scribed in the next section. 
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5 Historical Data Set 

The primary objective of the presented work is to au- 
tomatically learn a classification model for updating the 
black and the white lists in a timely manner. We as- 
sume that there are differences in the behavioral patterns 
of spammers and legitimate IPs and try to capture these 
differences. We also assume that the behavioral patterns 
of IPs may change over time due to adversary actions, 
such as stealing local network IPs of benign MTAs, by 
temporarily stealing IPs using BGP hijacking attacks, or 
by installing a bot agent at a end-user's device. For ex- 
ample, a small business that runs its own legitimate email 
servers may be subverted by some malware and become a 
Spam sender. In this case the IP may temporary enter the 
blacklists of large email service providers until the prob- 
lem is mitigated. We further assume that an analysis of a 
long period of time is important for blacklisting repeated 
spammers. However the recent behavior is important for 
rapid reaction to changes in the sender behavior. 
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Figure 1 : The HDS work flow. 

The idea of HDS in a nutshell is to aggregate email 
log records across multiple variable length historical time 
windows in order to create informative statistical records. 

For each historical time window, HDS records contain 
multiple aggregations of the attributes in the EL data set. 
The statistical learning algorithm is applied on the aggre- 
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gated features to predict the behavior of an IP in the near 
future. Based on this prediction, HDS-based SRM will 
temporarily update the black and white lists. 

Intuitively, one special property of the proposed 
method is its ability to model the subject behavior over 
time (which is more informative than only its current 
state). In particular, it extracts features for a sender MTA 
by aggregating past transactions, which are then labeled 
using its behavior on future transactions. An HDS-based 
classification model, therefore, can estimate MTA behav- 
ior in the near future based on its past transactions. This 
information allows detection of 'Spammer' behavior pat- 
terns even with only very few Spam emails sent. This 
unique property can speed-up the blacklisting process, 
and hence improve the TPR. 

Next, we define the HDS building blocks. Figure [T] 
depicts the general work flow of HDS based SRM. HDS 
records are uniquely identified by a reference time To and 
IP. EL records preceding To are denoted with negative 
time indexes e.g. T-\. An Historical time window is a 
continuous range [T_ WQ . 2 i, To) where wq is the length of 
the smallest historical time window and i is a positive in- 
teger. Using exponentially growing historical time win- 
dows gives us two benefits. First, we are able to capture 
a long history of an IP without exploding the size of the 
HDS records. Second, the size of the most recent time 
windows is small enough to detect the slightest change in 
the behavior of an IP. In Section [7] we will show that the 
number of historical time windows should be carefully 
chosen in order to obtain the best performance. Choos- 
ing the best length of the smallest time window (w>o) is 
not intuitive either, however, will not be covered in this 
paper. 

Let FSip^i be a set of aggregated features of a partic- 
ular IP, computed using EL records in a historical time 
window [T" w 2*,To). Each HDS record contains n fea- 
ture sets (FS IP: t 0: o,FS IP: t 0: 1t • • ,FS IP , To:n -i). Every fea- 
ture set FSip : r j includes aggregates of all features ex- 
tracted from the email logs. The actual set of features 
depends on the email service provider and the nature of 
the email logs. In this paper we have constructed feature 
sets for the HDS records by taking the sum, mean, and 
variance of the number of recipients (NR), the number of 
addressing errors (AE), the CBF processing time (PT), 
and the SpamClass. Note that the mean of SpamClass is 
in fact the spammingness of the IP in the respective time 
window. In addition, for each time window we have also 
included the total number of emails sent and the number 
of times the sender changed their behavior from sending 
Spam to sending legitimate emails, and vice versa. The 
last feature plays a significant role in the classification of 
senders as can be seen from Table [5] 

Definition 2 Erraticness of a sender IP ( denoted by 
Zip,[T start j end )) is the number times the sender changed his 



Table 5: Infogain ranking of HDS features (n = 5, wo = 
60mm). 





Historical Time Windows 


Aggregate Features 


1 


2 


3 


4 


5 


SpamClass_Mean 


0.1123 


0.2135 


0.3561 


0.5116 


0.7118 


NR_Mean 


0.0842 


0.1592 


0.2645 


0.3766 


0.5102 


SpamClass_Sum 


0.0886 


0.1634 


0.2613 


0.3528 


0.4509 


NR_Sum 


0.0668 


0.1212 


0.1942 


0.2663 


0.3518 


PT_Mean 


0.0392 


0.0769 


0.1308 


0.1840 


0.2465 


PT_Sum 


0.0363 


0.0695 


0.1164 


0.1664 


0.2289 


Emails_Sent 


0.0144 


0.0323 


0.0639 


0.1034 


0.1553 


PT_Variance 


0.0146 


0.0329 


0.0639 


0.1006 


0.1429 


NR_Variance 


0.0034 


0.0085 


0.0219 


0.0438 


0.0718 


SpamClass_Variance 


0.0015 


0.0038 


0.0086 


0.0162 


0.0293 


B ehavior_Changes 


0.0013 


0.0032 


0.0068 


0.0119 


0.0191 


Total: 


0.0420 


0.0804 


0.1353 


0.1940 
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behavior from sending Spam to sending legitimate emails 
and vice versa during the time window \F start ->F end)- 

The goal of the Erraticness method is to detect MTAs 
that are about to change behavior to 'Spamming' and 
will stay that way for a long period of time, i.e., have 
a small Erraticness value. Note that the underlying no- 
tion of Erraticness, that MTAs sending behavior can al- 
ternate, i.e., MTA can send ham emails for a while, then 
send some Spam emails for a while and later go back 
to send ham emails once again, is indeed realistic. In 
fact, our EL-dataset (discussed in section [672] ) shows that 
25.6% of the MTAs changed their sending behavior dur- 
ing a single day (8,804 IPs change their behavior more 
than 4 times). In section [7] we show that the HDS (Er- 
raticness) can learn to avoid blacklisting such MTAs to 
avoid false-positives. 

Let [TojTpred) be the prediction time window where 
Pred is its length. Let Class IP [ To Tpred ) be the target at- 
tribute of the HDS records used to train machine learn- 
ing classifiers. HDS records are identified by an IP and a 
reference time 7b. They contain n feature sets that corre- 
spond to n historical time windows and a target attribute. 
Let HDS be a relational data set derived from EL: 

HDS = (7P,7b,FS/p ? r 0? i, . . . ,FSip^ n ,Class IP [ To j Pred )) 1 

where / is the number of records. 

Table [6] depicts the historical data set structure. Each 
HDS record is identified by IP and a reference time 7b 
and contains n feature sets. 

Next, we describe two variants of an HDS-Based 
SRM. The target attribute of the first variant is the fu- 
ture Spammingness of an IP and the target attribute of 
the second variant is its future Erraticness. We will de- 
note these two variants as HDS-Based (Spammingness) 
SRM and HDS-Based (Erraticness) SRM, respectively. 
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Table 6: The HPS structure 
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5.1 HDS-Based (Spammingness) SRM 

In order to train the HDS-Based (Spammingness) classi- 
fier, we set the target attribute of every HDS record to be 
the Spammingness of the IP in a time period following 7q 
(^iP,[T ,T Pred ))- Table [7] shows an example of an historical 
data set derived from the email log in Table [2] HDS con- 
tains one instance per IP per time unit, where time unit 
was defined as the size of the smallest time window (wo). 
In this example, wo = 1, n = 4, and Tp re d — 4. EC stands 
for email count and AE stands for the sum of addressing 
errors. Both are examples of aggregated features that are 
calculated for each one the four feature sets, while Spam- 
mingness is calculated according to Definition [T] In our 
experiments, each feature set contained 13 different fea- 
tures (see Table [5J. 

Given the HDS with a target attribute set to Spam- 
mingness, we can train a machine learning based classi- 
fier to predict the future Spammingness of IP addresses. 
These predictions are used by the HDS-Based (Spam- 
mingness) SRM to roughly distinguish between spam- 
mers and non spammers. If the predicted Spamming- 
ness is higher than a given threshold (e.g., 0.5), we 
then apply the BIT threshold on Spammingness (i.e 
SpamClass m ean) in the largest historical time window. 
Symmetrically, if the predicted Spammingness is lower 
than the given threshold the WLT, is applied to deter- 
mine whether or not the IP is added to the white-list. 

The operation of HDS-Based (Spammingness) SRM 
resembles the Heuristic SRM described in Section |4] 
Here, however, the heuristic rule is augmented by the 
prediction of a machine learning classifier. In Section 
[7] we will show that this combination produces a high 
quality SRM. Preliminary experiments (not presented in 
this paper) show that using the predicted Spammingness 
only, without applying the BIT and WIT thresholds on 
historical Spammingness, results in poor classification. 

5.2 HDS-Based (Erraticness) SRM 

Another variant of the HDS-Based SRM uses a machine 
learning classifier to predict the stability or Erraticness 
of the IP behavior in the prediction time window. In or- 
der to train the HDS-Based (Erraticness) classifier we set 



the target attribute of every HDS record to be the Erratic- 
ness of the IP in a time period following 7q (Zip\T 0l T Pred ))- 
Classifiers trained on this data are used in a slightly dif- 
ferent way than classifiers trained to predict the Spam- 
mingness of an IP. The difference is mainly in the rule 
we use to arrive at the second part of the black-listing 
/ white-listing decision process. If the ML model pre- 
dicts an unvarying behavior {Erraticness « £), after 
TO, meaning that the IP is not expected to change its be- 
havior in the nearest future, we apply the same rule that 
guided the heuristic-SRM. 

First, we check whether the predicted Erraticness is 
very close to zero, meaning that IP is not expected to 
change its behavior in the nearest future. If it is, then the 
same rule that guides the Heuristic-SRM is applied. 

That is, if the Spammingness of the IP in the longest 
historical time window is above the blacklisting thresh- 
old BIT, then the IP is blacklisted. Symmetrically, if 
changes in the IP behavior are not predicted and the IP 
Spammingness is below the whitelisting threshold, then 
the IP is added to the white-list. 

Experiment results presented in Section [7] show that, 
in terms of AUC score, the HDS-Based (Spammingness) 
SRM consistently performs better than all the other ex- 
amined SRMs. In fact, the former SRM receives the 
highest performance metrics and is the most tolerant to 
configuration changes. 

6 Evaluation Methodology 

6.1 Evaluation Environment 

It is not possible to directly compare ML models trained 
on the EL and the HDS data sets. The main difficulty 
is that the number of instances and the target-variable 
in both data sets is different. Each EL instance cor- 
responds to a single email, while each HDS instance 
represents email aggregates for a time period. There- 
fore, we implemented a unique evaluation environment 
around WEKA machine-learning tool [12] that allows for 
an evaluation of the aforementioned SRMs on a common 
ground. In addition to the HDS -based and the EL-based 
SRMs, we implemented an Huristic-SRM. The Huristic- 
SRM blacklists IPs that, during a past time window, had 
a spam fraction value greater than the BIT value. As op- 
posed to HDS and EL SRMs, the Huristic SRM does not 
use machine-learning, and therefore, does not need train- 
ing. The Hueristic-SRM is currently deployed at the ISP, 
from which our data is originated, and is used for up- 
dating their black-list. The evaluation environment was 
designed to simulate the general filtering process of in- 
coming emails. 

The HDS evaluation environment, (see Figure [3} con- 
tains four modules: controller, while-list, black-list, and 
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Table 7: Example of an HDS 
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a reputation mechanism. The sender IP of every incom- 
ing email (i.e., an EL instance) is first looked up in the 
white-list (steps 1: and 2: in Figure [3]). A positive result 
cause the email to be accepted (steps 3: and 9:). EL in- 
stances that have been accepted are passed to the SRM 
(step 6:) in order to update the reputation model. Emails 
arriving from blacklisted MTAs are rejected without fur- 
ther processing (steps 4:, 5:, and 9:). If neither the white- 
list nor the black-list contain the IP, the classification 
result of the SRM is used to make the final decision and 
to update the lists, if necessary (steps 6:, 7:, 8:, and 9:). 
Note that in the case of the HDS-Based and the Heuristic 
SRMs, emails received from blacklisted IPs are ignored 
since they never reach the reputation mechanism. The 
general experimental settings are depicted in Figure [2] 



Figure 2: The HDS evaluation space. 




Dataset 



Sender 
Reputation 



Mechanism 




EL-Based 





HDS-Based 



Classification 




Mode 


Batch Mode 




Continuous Mode 


Black Listing 


i ^^^^^^ 
i 

^- 








Method 


Train and Evaluation 




Evaluation- Set Only 




Sets 





In contrast to continuous classification where each 
email received from non blacklisted IP is passed to the 
SRM, the evaluation environment can operate in a batch 
mode. In this mode the incoming emails are logged, but 
the reputation mechanism is activated once in a prede- 
fined time period. After processing the logged emails, 
the SRM returns the controller two sets of addresses. 
One set contains the addresses that should be black- 
listed and the other contains the addresses that should 
be whitelisted. Either set, of course, can be empty. 
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Figure 3: HDS evaluation environment 

6.2 Dataset 

To evaluate the proposed SRMs, we made use of a sin- 
gle email log datasets which contained 9.507 million 
anonymized log entries (emails headers) of 678,509 dis- 
tincs IPs, which were received during a 168 hours (7 
days) period at T-Online ISR The dataset is comprised 
of 9 attribues and 12.25% 'Spam' labeled instances. The 
un-received emails, which were blocked by the T-Online 
black-list, were not logged and therefore their headers 
are not included in the dataset. The dataset was fully la- 
beled by an automatic content-based filtering device, eX- 
purgate Q The eXpurgate claims "A Spam recognition 
rate of over 99%" and "zero false positives" with unpub- 
lished false negative rate. The dataset was partitioned 
into training and validation sets, each containing the in- 
stances of all the emails sent by 200k randomly selected 
sender IPs. The training set (2,835,214 instances) and 
validation sets (2,864,208 instances) are mutually exclu- 
sive, meaning that IPs that exist in the training set do not 
appear in the validation set, and vice versa. 

6.3 Performance Metrics 

In order to evaluate the sender reputation mechanisms 
discussed in this paper, we use the following per- 
formance metrics: classification error rate, true posi- 
tive rate, false positive rate, area under the ROC (Re- 



1 Eleven, eXpurgate Anti Spam, |http : //w ww . eleven . de/ 
overview-antispam.html 
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Table 8: Summary of the experiment setup. 
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ceiver Operating Characteristic) curve, black-list size, 
and number of white-list hits. These performance met- 
rics provide enough information to assess how well the 
HDS-Based SRM could be used to both reduce the load 
from mail servers, and reduce the number of potential 
customer complaints. 

Classification error rate is the rate of incorrect predic- 
tions made by a classifier and is computed by the follow- 
ing equation: 

FP + FN 

Error = 

TP + TN + FP + FN 

where TP, TN, FP and FN stand for the number of true 
positive, true negative, false positive, and false negative 
rejections of emails respectively. 

We will use the Area Under the ROC Curve (AUC) 
measure in the evaluation process. The ROC curve is a 
graph produced by plotting the true positive rate (TPR = 
TP /(TP + FN)) versus the false positive rate (FPR = 
FP/(FP + TN)). The AUC value of the best possible 
classifier will be equal to unity. This would imply that it 
is possible to configure the classifier so that it will have 
0% false positive and 100% true positive classifications. 
The worst possible binary classifier (obtained by flipping 
a coin for example) has an AUC of 0.5. The AUC is 
considered as an objective performance metric as it does 
not depend on the specific discrimination threshold used 
by a classifier. 

Black list size is the number of IPs added to the black 
list during each experiment execution. The blacklist size 
mainly affects the IP lookup time and the amount of 
computational resources spent on its maintenance l24ll . 
Faster lookup times mean less delay in email delivery, 
while the computational resources required to maintain 
the black-list directly translate into cost. 



The number of white-list hits is an indication of the 
number of emails that were delivered without content in- 
spection. A Higher number of white-list hits means less 
computational resources spent on Spam filtering. 

7 Experimental Results 

In order to assess the effectiveness of HDS-Based SRMs, 
we implemented HDS-Based (Erraticness), HDS-Based 
(Spammingness), EL-Based, and Heuristic SRMs within 
WEKA machine-learning framework F]| 31 ]. The evalu- 
ation environment presented in Figure [3] was also imple- 
mented within WEKA as a special classifier which uses 
the SRMs to update the black and white lists. The train- 
ing of the EL-Based and HDS-Based sender reputation 
algorithms was made using a cross validation process. 
The data sets were ordered chronologically to preserve 
the order in which the emails were received. The same 
cross-validation procedure was applied to all tested set- 
tings, shown in Table [8] 

The blacklisting threshold BLT and the whitelisting 
threshold WLT, as described in Sections [4] and [5] are pa- 
rameters of the evaluation environment. In the following 
experiments these two threshold were fixed and equal for 
all SRMs. The value of BLT and WLT were empirically 
chosen to be 0.5 and 0.05 respectively. These values as- 
sure low false positive rates while resulting in relatively 
high true positive rates. 

Both the black and the white lists were empty in the 
beginning of the experiments. It is also possible to ini- 
tialize the black and white lists by executing the evalu- 
ated SRM on the train data. In this scenario the address 

2 The code, dataset, and operation instructions can be downloaded 
from #Anonymized# 
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lists already contain valuable information in the begin- 
ning of the testing phase. In this paper, however, we fo- 
cus on the more challenging task that involves the con- 
struction of the address lists from scratch. 

7.1 The Value of HDS Aggregations 

The first and most important question in this study is 
whether the HDS-based SRM outperforms other related 
SRMs. To answer this we compare HDS-based SRM to 
EL-based SRM on an identical test bed. We argue that in 
general HDS-based features are at least as informative as 
EL-based features because they are derived directly from 
EL-based features and thus HDS features should produce 
a superior classification model, with respect to that pro- 
duced using the corresponding EL dataset. In this ex- 
periment we simulated a condition in which every email, 
whose sender's IP is not included in either the black or 
white lists, is classified by the tested SRMs on arrival. 
We denote this classification mode as 'continuous mode' . 

In order to increase the reliability of the experi- 
mental results, we evaluated the HDS-Based and EL- 
Based sender reputation mechanisms on four different 
machine-learning algorithms from separated machine- 
learning families. The algorithms used were: Naive 
Bayes (Bayes) 1 15 ], C4.5 (Decision Trees) |21 ], Logistic 
Regression (Function) [8], and BayesNet (Bayes). Due 
to memory constraints, in this experiment the train and 
validation set contained the email sending of 50,000 ran- 
domly selected IPs. 

The HDS instances were generated using the follow- 
ing parameters: wo = 60 minutes, n — 5, and Tp r ed — 60 
minutes. The total history length is: 

AT = w - 2 ( " _1) = 960 minutes (16 hours) 

Table [9] presents the results of this experiment. We 
also applied the Heuristic SRM as a baseline to com- 
pare with other techniques. The time window used by 
the Heuristic SRM for computing the spammingness of 
the IP addresses was set to 960 minutes. The blacklist- 
ing and the whitelisting thresholds were set to BIT = 0.5 
and WLT = 0.05, respectively for all SRMs. 

Judging the best results for each of the SRMs, we 
can see from Table [9] that the HDS-SRM (Erraticness) 
had the best IP classification performance. Not far be- 
hind at second place is the HDS-SRM (Spammingness). 
Third place, with noticeable AUC difference, is the EL- 
based SRM. The worst performance was achieved by 
the Heuristic-based SRM. Interestingly, the best results 
for each SRM were obtained by different learning al- 
gorithms. Moreover, it seems that EL-based SRM is 
much more sensitive to the learning algorithm choice, 
compared with the HDS-Based SRMs. The Logistic- 
Regression, which worked very well for both HDS- 



Table 9: Dependability on ML classifiers. 
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0.046 
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SRMs, produced the worst results when applied to EL- 
Based. In this cases, very few IPs were blacklisted by 
the EL-based SRM, consequencly, obtaining a very poor 
AUC score. 

The highest true-positive rate was acheived by the 
HDS-SRM (Spammingness), whereas the lowest false- 
positive and error rates were obtained by the HDS-SRM 
(Erraticness). The EL-based SRM blacklisted the most 
IPs. However, many of these IPs were of benign senders, 
which is reflected in the very high false-positive rate ob- 
tained by this SRM. 

Finally, it is noticeable that in some configurations the 
very simple Heuristic and the EL-based SRMs achieved 
comparable performance. We also noticed that both the 
Heuristic SRM and the EL-Based SRM roughly match 
the reported state-of-the-art performance. 

7.2 Batch IP Classification 

In the previous subsection we discussed continuous clas- 
sification mode where the sender reputation is computed 
and updated each time a new email is received. This 
mode of operation may not be realistic due to relatively 
high resource consumption of machine learning based 
classifiers compared to black and white lists data struc- 
tures. Moreover, classifying every incoming email would 
also mean placing the classifier in the critical path and 
turn it into a bottleneck during the process of handling 
incoming emails. Another drawback of the continuous 
classification mode is that most black-lists are optimized 
for fast information retrieval but do not tolerate frequent 
updates. Updating the black-list data structure may be a 
very expensive operation in terms of computational re- 
sources (24). 

It is therefore a good practice to minimize the number 
of updates and to make them as infrequent as possible. 
In practice, SRM can be activated once in a while in or- 
der to save computational resources. The payoff for a 
periodic activation of SRM is a window of opportunity 
during which spammers who are not yet blacklisted can 
send large amounts of Spam without being blocked. 



10 



-♦- HDS-Based SRM (Erraticness) ■» HDS-Based SRM (Spammingness) 

EL-Based SRM ^-Hueristic SRM 

1 n 1 1 =,= 1 1 1 



0.1 



> 




0.001 1 — — I — — I — — I — — I — — I — — I 

0.5 1 2 4 8 16 32 64 

Time between successive batch classification (min.) 

Figure 4: FPR as a function of time between consequent 
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Figure 5 : TPR as a function of time between consequent 
address-lists updates. 

In order to investigate the impact of black and white 
lists' update frequency on the accuracy of Spam filter- 
ing, we executed the reputation mechanisms in a batch 
mode with various update frequencies. The SRMs were 
executed each k minutes where k was set to 0.5, 1, 2, 5, 
15, or 60 minutes. HDS parameters were: wo = 60 min- 
utes, n = 5, and Tp re d = 60 minutes. In this experiment 
we used the BayesNet algorithm to train classifiers for 
both the HDS-Based SRMs and the EL-Based SRM. 

The results presented in Figures [4j [5] and [6] depict the 
superiority of HDS-Based SRMs also in batch mode. 

Figure [4] shows that the HDS-SRM (Erraticness) has 
the lowest false reject rate (FPR), while the EL-based 
SRM has the highest. Looking at the predictability re- 
sults (TPR), we see that the Heuristic SRM had the worst 
results, whereas HDS-based SRM (Spammingness) had 
the highest results among the SRMs. Figure [6] shows that 
for all update frequencies, the HDS-SRM (Spamming- 
ness) has a considrebly high AUC when compared to 
the Heurist-based and El-based SRMs. Since the AUC 
metric is an objective performance metric that does not 
depend on a configurable threshold, it better reflects the 
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Figure 6: AUC as a function of time between consequent 
address-lists updates. 

superiority of the HDS over the other tested SRMs. 

In general, all four SRMs' performance detereorated, 
more or less at the same rate, as the black and white lists 
update frequency was reduced. This was well demon- 
strated by the drop in predictability, and AUC scores. In- 
terestingly, the SRMs' false-positive rates were not sen- 
sitive to the update rate, and tend to stay constant. 

7.3 The Effect of History Length 

The evidence presented in previous subsections sug- 
gests that aggregating sender MTA behavior over time is 
worthwhile and yields good classification models. The 
models created from the HDS when trying to predict 
the Spammingness of IP addresses are the least sensitive 
both to the choice of the machine learning algorithm and 
to the frequency of address-lists updates. We therefore 
focus this subsection on the HDS-Based (Spamming- 
ness) SRM, investigating its performance as the func- 
tion of the number of time windows used to construct 
the HDS. 

We compared multiple models of the HDS-Based 
(Spammingness) SRM trained on different HDS train 
sets that were constructed from one to fourteen time win- 
dows. The models were induced by the B ayes-Net al- 
gorithm on HDS instances, generated for every incom- 
ing email using: wo = 15 seconds, n = 1, . . . , 14 and 
Tpred — 60 minutes. Both the black and white list were 
cleared once per 1,440 minutes (1 day). 

The experiment shows a mixed trend in the classifi- 
cation performance as a function of the number of his- 
torical time windows. The results presented in Figures 
[7] and [8] and Table [10] show that up to the ninth his- 
torical time-window the performance of the HDS-Based 
SRMs improves, whereas, when using more historical 
time-windows, the AUC score ceased to improve, indi- 
cating a change of trend. 
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Table 10: The effect of history length 
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Figure 7: The effect of history length on the SRMs' 
clssification performance 
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Figure 8: The effect of history length on the black and 
white lists 



Note that the experiment settings were not optimal 
w.r.t. initial historical time window (Wo = 15s) resulting 
in lower TPR and higher FPR that the respective perfor- 
mance metrics reported in Section 7.1 Yet, this exper- 
iment illustrates the effect of the number of time win- 
dows on performance of HDS-based SRM. Due to the 
very small initial historical time window (i.e., only 15 
seconds) we are able to notice the decrease in false pos- 
itive and error rates as more time windows were used. 
Surprisingly, the true positive rate was also gradually de- 
creased as the number of historical time windows grew. 
This can be explained by both the decrease in the black- 
list size and the growing number of features that added 
additional dimensions to the machine-learning problems, 
and therefore, made it more and more complex to learn 
from (a.k.a. "the course of dimensionality"). Notice that 
the FPR decreased faster then the TRP as the number of 
historical time windows increased. 

The blacklist size declined at a constant rate, as more 
historical time windows were used. Note that the black- 
list's size decline had probably affected (i.e., reduced) 
both the TPR and the FPR, since less IPs were classi- 
fied as spammers. In contrast to the black list, the white 
list average size increased until the tenth time window, 
and then the trend was reversed, were it begun an ac- 
celerated decline in size. Interestingly, the number of 
whitelist hits remained more or less stable from the 9th 



to the 14th time windows, even when the white list aver- 
age size dropped. Since the AUC score during these time 
windows (9th to the 14th) was constant, we conclude that 
for the whitelisting task, the number of history time win- 
dows should be greater than nine. Overall, we see that as 
the more history is used beyond the 9th time window, the 
effectivess of the black and white lists increases. 

8 Discussion and Future Work 
8.1 IP Classification Performance 

Experiment results presented in this paper show that ag- 
gregating behavior of MTAs over time is an effective way 
to elicit valuable information from email logs. The pro- 
posed method was found to be more effective than email- 
log-based and heuristic-based SRMs, tested under the ex- 
act same conditions. In fact, the ISP whose dataset we 
evaluated uses a blacklisting method which is similar to 
the Heuristic SRM, while the Email-log-based SRM is 
similar to state-of-the-art methods l23l[T3l l5ll. 

The same machine learning algorithms applied on 
HDS produce much more effective models than if ap- 
plied on the non-aggregated data extracted from the raw 
email logs. The best results were obtained using HDS- 
Based (Erraticness) SRM with nine time windows: AUC 
0.907, TPR 90.7%, and FPR 0.4%. To the best of our 
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knowledge these results are better than previously re- 
ported SRMs evaluated on data sets of similar scale. In 
fact, the accuracy of HDS based SRMs approaches the 
accuracy of content-based filtering. Another interest- 
ing fact is that HDS-based SRM blacklisted roughly the 
same number of IPs as the EL-based SRM, while incur- 
ring, by far, fewer classifications errors. 

Some related works (e.g. fT3ll ) reported roughly the 
same performance as the EL-Based SRM reported here. 
Despite the differences in the particular features, we be- 
lieve that aggregations over multiple time windows can 
boost the performance of sender reputation mechanisms 
based on statistical learning. ON our data set the upgrad- 
ing of EL-based SRM to HDS-based SRM resulted in 
an elevated performance. We suggest that our results are 
general enough to motivate "upgrading" EL to HDS in 
other data set too. 

In the second experiment we studied the impact of pe- 
riodical execution of SRM on its effectiveness. The re- 
sults show a clear trade-off between batch size and the 
effectiveness of Spam filtering. A less frequent execution 
of SRM results in less predictability power, as expected. 
The main reason for inefficient blacklisting when sender 
reputation is computed once in a long time period is the 
tendency of spamming bots to send a number of Spam 
emails during a very short time period and go silent af- 
terwards [22, 14 j. The observed deficiency of periodi- 
cal activation of the evaluated SRMs could also be ex- 
plained by a reduced marginal benefit when compared to 
the email service provider's own SRM that is activated 
once in a while. 

In Section |7.3| we studied the influence of history 
length on the performance of the reputation classifier. 
The results show that in general, the longer history is 
used the better classification model is produced. 

Increasing the number of time windows (and hence the 
number of features) above a certain point have resulted 
in an increasing efficiency of both the black and white 
lists, and thus a decreased workload of the entire filtering 
system. At the same time, instead of growing further, the 
AUC remained constant more or less. This phenomenon 
occurs probably due to the "course of dimensionality" 
effect, in which the growing dimensionality of the dataset 
plays a negative role, making the learned concept more 
and more complex. 

In order to capture a longer mail sending history us- 
ing fewer historical time windows, the size of the small- 
est historical time window wocan be increased. Unfortu- 
nately, in this case the most recent behavior of the MTAs 
would be diluted and damage the ability of the HDS- 
based SRM to respond to sudden behavioral changes. 
The "course of dimensionality" phenomenon can also be 
tackled by selecting the most informative features. Cur- 
rently, we leave the optimal configuration of HDS con- 



struction as an open issue. 

Computing the reputation of a sender IP using HDS- 
based SRM is a computationally intensive task due to 
both the construction of HDS records and the classifi- 
cation using machine learning models. The HDS con- 
struction could be optimized by using past HDS records 
to compute the aggregated features of a new one. This 
should be further studied in a future work. On the 
machine-learning end, for lowering the overall computa- 
tional requirements of the HDS-based SRM we suggest 
using a non-complex classification models, e.g., Naive 
Bayes or B ayes-Net. 

8.2 Reducing the Filtering Workload With 
HDS 

To insure that the end-users receive only very few spam- 
mails, ISPs usually employ a filtering mechanism based 
on black and white lists. These lists need to be up- 
dated frequently, so to insure minimal filtering errors. 
Currently, there are three methods for updateing these 
lists: real-time black listing (RBL), SRMs, and content- 
based filters (CFBs). The SRMs and CBFs are much 
slower, and computation-power demanding, compared to 
the black and white lists. However, the CBF and SRMs 
only filter emails that were filtered-in by the black and 
white lists. Thus, as the black and white lists hit-ratio 
increase, fewer emails are need to be processed by the 
CBF and SRMs, and hence the filtering process becomes 
more computationally efficient. While a very high black 
and white list hit ratios might incur a very computen- 
tionally efficient filtering, it can result with high filtering 
errors, and so there is a constant trade-off between fil- 
tering efficiency and accuracy. In order to increase both 
the filtering computationally efficiency and accuracy, we 
propose using HDS-base SRMs, as method for updating 
both the black and white lists. 

On our experiments the black and the white lists 
had on average 457,120 and 1,916,636 hits respectively. 
Each hit corresponded to a spam or benign MTAs email 
that was not put through a content-based filter. There- 
fore, the filtering workload decrease (FGain) is: 



FGain = 



black and white lists hits 
emails in TestSet 



2,373,756 



: 0.828 = 82.7% 



2,864,208 

This means that more than 4 out of 5 emails had hit one of 
the lists, and therefore, skipped the content-based filter, 
thanks to the HDS-RM black and white listing. This is a 
very significant contribution for ISPs, as their entire fil- 
tering workload, and conseqencly the energy consumed 
during the email filtering process can significanly be re- 
duced. 
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